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Estimating the probability distribution q governing the behaviour of a certain variable by sampling 
its value a finite number of times most typically involves an error. Successive measurements allow 
the construction of a histogram, or frequency count f , of each of the possible outcomes. In this 
work, the probability that the true distribution be q, given that the frequency count f was sampled, 
is studied. Such a probability may be written as a Gibbs distribution. A thermodynamic potential, 
which allows an easy evaluation of the mean Kullback-Leibler divergence between the true and 
measured distribution, is defined. For a large number of samples, the expectation value of any 
function of q is expanded in powers of the inverse number of samples. As an example, the moments, 
the entropy and the mutual information are analyzed. 
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I. ESTIMATING PROBABILITIES FROM 
EXPERIMENTAL FREQUENCIES 

The estimation of probability distributions from a lim- 
ited number of samples typically involves an error. Con- 
sider, for example, a random variable that can be either 
or 1, both values with probability 1/2. An experimenter 
measures the variable, say, four times. If no (similarly, 
ni) is the number of trials the result was (correspond- 
ingly, 1), the possible outcomes are no = j,n± = 4 — j, 
where j may vary between and 4. Each of those possi- 
bilities has probability 3/2j!(4 — j)l of occurring. If the 
experimenter estimates the underlying probability from 
the frequencies, his or her claim will be that the prob- 
ability of getting a zero is no/4. However, in view that 
no depends on the particular outcome of the four trials, 
only a fraction 3/16 of the times will this procedure give 
the correct result, that is /o = 9o = 1/2- 

In the above example, there are three probability dis- 
tributions involved. First, there is the true underlying 
probability q, actually governing the outcome of the ex- 
periment. In vector notation, q = ((70,91), and in the 
particular instance above, q = (1/2,1/2). Then, there 
is the frequency count f = (/o, /1), where fi is obtained 
by dividing rn by the total number of measurements N 
(four, in the example). And finally, there is the proba- 
bility that f = q. To define this last probability, one has 
to consider all possible samples of N trials, and evaluate 
how often the condition f = q is fulfilled. 

More generally, one can define the probability of mea- 
suring a particular f, while the underlying q remains 
fixed. This means to consider a probability distribution 
of all the possible frequency counts. The independent 
variable is the vector f, which varies in a discrete set, 
and the dependent variable is p(i |q). 

The frequency count f is an estimation of the under- 
lying q. In many applications, however, one is interested 



not quite in q, but rather in some function of q. Treves 
and Panzeri [j], for example, have quantified the mean 
error that an experimenter makes when evaluating the 
mutual information in the frequency count f , as an ap- 
proximation to that in the true (and unknown) q. Their 
analysis was made in the same spirit as above, that is, 
they have considered q fixed, while the value of f de- 
pended on the particular outcome of N measurements. 
They have obtained a clean analytical result, under an 
independence approximation. Their approach may be 
naturally generalized to situations where q is a probabil- 
ity density, that is, varies in a continuous set 

However, what the experimenter knows is not the true 
q, but one particular f, obtained after TV observations. 
His or her aim is to estimate the most probable value 
of q (or of some function of q) from the knowledge of 
f . More generally, the experimenter may be interested 
in the whole distribution P(q|f), that is, the probability 
that the true distribution be q, given that he or she has 
measured f . This means to settle the problem the other 
way round as was studied by Treves and Panzeri, and in 
the example above. It actually corresponds to Wolpert 
and Wolf's approach H in the estimation of entropies. 

In the following section, the properties of the distribu- 
tion P(q|f) are studied. In Sect. Ill, P(q|f) is written 
as a Gibbs' distribution, where the inverse number of 
samples plays the role of an effective temperature, and 
the Kullback-Leibler divergence between f and q is the 
equivalent of the energy of state q. As a consequence, 
a thermodynamic potential is defined, thus allowing the 
calculation of the mean Kullback-Leibler divergence be- 
tween f and q by simple derivation. This inspires the 
expansion made in Sect. IV, where the expectation value 
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of an arbitrary function of q can be written as a power 
series in the inverse number of samples. The case of the 
entropy, the mutual information, or any moment of the 
distribution q is shown in the examples of Sect. [V|. Next, 
in Sect. VI the analytical results are c onfro nted with 
numerical simulations. Finally, in Sect. [VI | , the main 
results are summarized and discussed. 
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II. THE PROBABILITY DISTRIBUTION FOR 
THE TRUE PROBABILITY DISTRIBUTION 

Consider the random variable X taking values from the 
set x = [x\, ...,xs), with probabilities q = (qi, qg). 
In principle, there is no need that x%, ...,Xs be numerical 
values, it suffices them to be any exclusive and exhaustive 
set of categories. 

An experimenter makes iV observations of the value 
of X and builds a histogram n = (rii, ...,ng), where rii 
is the number of times the outcome was X{. The ex- 
perimenter considers the frequencies f = (/i,...,/s) = 
(jii/N, ns/N) as an estimation of the true underlying 
probability distribution q. If the measurements are taken 
independently, the probability of measuring f given that 
the data are sorted according to q is equal to the prob- 
ability of observing each Xi a number rii of times, that 
is, 



p(f |q) = N\ Ik^r = 



Nl 



■ exp 



(1) 



However, the knowledge the experimenter has at hand is 
f , not q. He or she may therefore wonder what is the 
probability that the true distribution be q, given that 
the outcome of the experiment was f. This means to 
evaluate a probability density P(q|f ), whose independent 
variable q runs over all the possible distributions of the 
data. That is, all vectors in 5R S such that 



i 

< qi < 1, Vi. 



(2) 



The set of all q obeying Eqs. (|J) constitutes the domain 
D where P(q|f) is defined. It is a finite portion of an 
(S— l)-dimensional plane embedded in 5ft s , and is normal 
to the vector (1, 1, 1). 

Notice that since each /j is the ratio of two natural 
numbers, the set of possible frequencies f is discrete. 
The domain T>, on the contrary, contains a continuum 
of distributions q. Consequently, p(i |q) is a probability, 
whereas P(q|f) is a density. 

Bayes' rule states that 



^(q|f) 



p(f|q)P(q) 
P(f) ' 



(3) 



where P(q) is the prior probability distribution for q, 
and 



P(f) 



/ P(f|q) P(q) dS q 



(4) 



Here, dSq is a volume element, in T>. 

The prior P(q) contains all additional pieces of knowl- 
edge about q, apart from the experimental data. Here, 
the assumption is made that there is no a priori knowl- 
edge. However, it turns out to be crucial to specify what 



is it that is not known [^]. A prior that is uniform over 
V, as was used by Wolpert and Wolf j|, is certainly not 
uniform over any non linear function of q, for example 
the log-likclihood. Thus, not knowing anything about q 
implies knowing something about lng, which in turn may 
result in awkward scaling properties. In this work, the 
power prior 



H 



0-1 



(5) 



is repeatedly used, with Z p = y/S[T((3)] s /T(S/3) (no- 
tice that when (3 — > 0, Zp — > \/~S). However, as was 
shown in JBj choosing any of these priors results in a sur- 
prisingly peaked a priori distribution of the possible en- 
tropies. Hence, the choice of the prior is a delicate issue 
and, in any particular application, it should be done care- 
fully. Here, no attempt will be made to instruct on the 
way such a choice should be made, but since the results 
that follow are strongly grounded on Bayesian inference, 
their validity is, at most, asgood as the prior" ||. 
Replacing Eqs. (Q) and ©in Eq. (g), 



pm = 



exp [-JV£>(f,q)]P(q) 



(6) 



where D is the Kullback-Leibler divergence between f 
and q 



D(f,q)=^/ i ln 



fi 



(7) 



and quantifies is the mean information for discriminating 
in favor of f against q, given the data B . The function 
Z reads 



Z= [ dSq P(q) exp[-ATD(f,q) 
Jv 



(8) 



In the remaining of the section, the properties of 
P(q|f) are studied for the particular P/?(q) defined in 
Eq. (Q). In doing so, the integral 

is frequently encountered. Equation (^|) was first derived 
in H , and an alternative proof may be found in the Ap- 
pendix. 

For the priors in Eq. (^|), the function Z Eq. (||) may 
be calculated analytically, and it reads 



Z = exp [NH(f)] y/S 



nf =1 T(Nf k +f3) 



T(N + S0) ' 
where H. is the entropy of a distribution 

S 



W(f) = -^/ i ln/ i . 



(10) 



(11) 
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Thus, replacing Eq. 

^(q|f) = 



in Eq. (§) 



T(N + Sf3) qfh+P- 1 



■T{Nfi + 0)- 



(12) 



The most probable q M = (qf 1 , ..^qg 1 ) is obtained by 
maximizing Eq. ( |l2| ) , under the normalization constrain. 
The result is 



M Nfi + p- 



1 



N + S(f) — 1) 



(13) 



Thus, if P(q) is uniform in T> (J3 = 1), then the most 
probable q is f. With the maximum likelihood prior 
(/3 — > 0), the most probable q is shifted from f to- 
wards lower counts. The Krichevsky-Trofimov estimator 
§(/3 = 1/2) and the Shurmann-Grassberger § /3 = 1/5 
lie in between. 

Using Eq. @ the expectation value of each component 
qi may be calculated, 



<<?*> = 



TV + 5/3' 



(14) 



For the uniform prior /3 = 1, this equation reduces to 
Laplace's estimator of probabilities, first introduced by 
in his Essay on probabilities. In figure [l] the difference 
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FIG. 1: Difference between (qi) and fi, as a function of 
The value of /3 has been set to 1. The three lines correspond 
to N = 3, 6 and 30. Here, X may take 3 values (S — 3). 
When fi < 1/3, the expectation value of qi is larger than the 
measured frequency fi. As N increases, the effect becomes 
less important. 

between (qi) and the frequency count fi is shown, for /3 = 
1. It is seen that when fi is smaller than 1/5, (qi) is larger 
than fi. On the other hand, if fi > 1/5, then (<&) < 
fi. That is, the mean value of qi is displaced from the 
frequency count so as to approach the flat distribution 
1/5. Of course, the larger the number of samples N, the 



smaller the effect. Changing the value of /3 is equivalent 
to re-scaling the vertical axis of figure [ij 

Typically, one wants to make a guess about the true q. 
Here, two possible estimators have been calculated: the 
maximum q M and the mean (q) . By using the maximum, 
one is choosing the value that is most probably correct. 
But of course, eventually one will also make an error. If 
one measures the error as a (q M — q) 2 , and averages it 
with P(q|f), its mean turns out to be larger than if one 
had chosen (q) ||. Hence, although q M is the estimator 
that gives the correct answer most frequently, if one cares 
for the typical size of the errors, (q) is a better choice. 

When using (q) as an estimator, the covariance matrix 
Ey may be of interest. By means of Eq. (^)it is easy to 
show that for i =/= j 



((ft ~ (Qi)){Qj ~ (%))) 
(Nf l + (3)(Nf 3 +f3) 
(N + SP) 2 (N + 5/3+1) 

when N > 5, 



(15) 



fifj 

' N 



whereas for i = j 



Si,; — 



<(ft-(ft)) 2 ) = 

(Nf i + f3)[N(l-f i ) + f3(S-l)} 



(16) 



(N + Sp) 2 (N 

m - 

N 



5/3 + 1) 
when N > 5, 



The negative sign in Eq. (|l^) derives from the normal- 
ization condition: since the sum of all qi is fixed to unity, 
if one of them surpasses its mean, it is to be expected 
that some other component will be below. In contrast, 
Eq. (|l^) shows that Ej, is always positive. 

The expectation value of q Eq. (|l4|) together with the 
covariance matrix Eqs. (|l5| ) and ( |16[ ) are useful to give 
the Gaussian approximation to P(q|f), centered in its 
mean: 



P(q|f) = K exp 



-(q-(q)) 4 E- 1 (q-(q)) 



(17) 



where the super-script t means tr ansp osed, and K is a 
normalization constant. Equation ( |l7| ) is only defined in 
the plane containing T>, normal to the vector (1, 1, 1). 
Actually, E does not have an inverse in the entire space 
3? s , since the direction (1, 1, 1) is one of its eigenvec- 
tors, with eigenvalue equal to zero. However, being E a 
symmetric matrix, it can be diagonalized by an orthog- 
onal basis. Hence, the 5—1 remaining eigenvectors lie 
in the plane containing T>. The restriction of E into that 
subspace is E, and its inverse is the matrix appearing in 
the exponent of Eq. (17). 

In order to normalize the approximation ( |l7j ) an inte- 
gral of a Gaussian function in T> is needed. This is cer- 
tainly not an easy task. If, however, one can assume that 
the distribution is sufficiently peaked so that P(q|f) w 0, 
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for q in the border of T>, then the domain T> can be ex- 
tended to the whole plane normal to (1, 1, 1). In that 
case, K^ 1 = y / 2irTljXj, where Xj are the 5—1 eigenval- 
ues of S. While the calculation of all the Xj is a difficult 
problem, it is quite straightforward to show that when 
N ^> 5, all the Xj are proportional to l/N. Therefore, 
the square root of each eigenvalue is a useful measure of 
the width of P(q|f) in the direction of its eig envector. 

However, the Gaussian approximation ( jl7| ) is not use- 
ful for other purposes, as for instance, calculating mean 
values, since it lacks from analytical expressions as (|^). 
As a consequence, in what follows, the full Eq. (12) is 
used. 

Equation ([)]) allows the evaluation of all moments of 



\9i, 



■ P)T(N + 5/3) 



T(Nfi + P)T{N + 5/3 + fc) ' 



(18) 



Since the moments are the coefficients of the Taylor ex- 
pansion of the Fourier transform of a distribution, the 
single-component distribution reads 



P(*|f) = Pfel/i, 



(19) 



B[Nfi+p,N(l-fi)+l3(S-l)Y 



where B(x,y) = T(x)T(y)/T(x + y). Figure [2] displays 
the distribution P{qi\fi) for three different values of N, 
and (3 = 1. In all cases, when N is large, the distribution 
is symmetrical, and reaches its maximum value in = 
fi = 1/3. In fact, it may be shown analytically that when 
N>1, 



lim PMfA = 



cxp[-( qi - f t ) 2 /2a 2 ] , (20) 



where a = [/j(l - f^/N] 1 / 2 . That is, the distribution 
tends to a Gaussian function centered at the experimen- 
tal frequency, and with a mean dispersion that diminishes 
with the square root of the number of samples. Notice 
that in this limit, P(q|f) does not depend on /3. 

It may be seen in Fig. || that for smaller values of N, 
the distribution is no longer symmetrical. In fact, since 
5 = 2 and /1 = 1/3 < 1/5, the tail in P(qi\fi) extends 
to the right, resulting in a positive (g>j) — fi, as predicted 
by equation (|is|). 



III. THE INVERSE NUMBER OF SAMPLES AS 
AN EFFECTIVE TEMPERATURE 



since then, several times in learning theory (see for exam- 
ple Q). In these cases, when fluctuation where neglected, 
the probability distribution under study had the form of 
Eq. ([|). In the present context, no approximations are 
needed to write down Eq. (j^). 

The exponential factor in (|6|) depends on q and f only 
in the combination D(f,q), diminishing exponentially as 
the divergence between the two distributions grows. Its 
maximum is attained when D = 0. It can be shown Q] 
that for any f and q, D({ , q) > 0, and the equality holds 
only when f = q. 

Defining the thermodynamic potential 



F = - InZ 



it follows that 



OF 

^ = dN' 
a 2 D = (D 2 (D) 2 ) 



d 2 F 

ON 2 ' 



(21) 

(22) 
(23) 



where the mean values ((•)) are defined by 

/ D (-)P(q|f)d5 q . 

For example, when the prior is given by Eq. (jq), 
(D) = W(f) - V(N + 5/3) + J2 fM^fi + 0), ( 24 ) 



where ^(x) = dlnT(a;)/da; is the Digamma function [JUj 
It is easy to show that 

5-1 



lim ID) = 



■0(1/N 2 ). 



(25) 



Here, both N and N fi have been supposed large, for all i. 
Since /, is of the order of 1/5, the above limit holds when 
N ^> S. Equation (^5|) states that for a large number of 
samples, the expected value of the divergence between 
the experimental frequencies and the true distribution 
does not depend on the measured f. It grows linearly 
with the number of items, and decreases as l/N . 
Accordingly, 



D 



= -^(N + 5/3) + ]T ff*\Nfi + /3), 



(26) 



where ^ 1 (x) — d^(a;)/da;, is the first Polygamma Func- 
tion En]. Taking the limit of a large number of samples, 



lim a 



D 



2N 2 



0(1/N 3 ). 



(27) 



In the limit N 3> 5, the mean quadratic dispersion does 
not depend on the measured fi. 



Equation (||) states that P(q|f) is completely analo- 
gous to a Gibbs distribution, where the number of sam- 
ples N plays the role of the inverse of the temperature, 
D(f, q) is the equivalent to the energy of the state q, 
and P(q) is the density of states. This analogy was first 
pointed out in the context of machine learning H, and 



IV. ESTIMATION OF FUNCTIONALS OF q, 
FOR A LARGE NUMBER OF SAMPLES. 

Many times, one is interested in the value of some 
function W(q). For instance, if X takes numerical val- 
ues, W may be the mean X = J^j 2 -*?'- Or, m some 
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other application, W may be the entropy of the distri- 
bution q (see equation (|ll|)). If the set X is the Carte- 
sian product of two other sets X = Z 1 x Z 2 , such that 
Vxi 6 X : Xi = (z\,zl), where z\ € Z 1 and z\ e Z 2 , then 
may be the mutual information I between Z l and Z 2 : 



I = ^q a b In 



Qa.q.b 



(28) 



where 



6 

9.6 = E^ 6 ' 



(29) 



Since q is unknown, an interesting guess for W(q) is 
its Bayesian estimation 



(W) = f W(q)P(q|f), 
Jv 



(30) 



which has the appealing property of minimizing the mean 
square error The zero order guess for (W) is W(f). In 
what follows, a systematic method to improve this value 
is derived. 

In the previous section the expectation value of the di- 
vergence between the true and the measured distribution 
was calculated, as well as the size of the fluctuations, for 
the priors in Eq. (J^j . As the number of samples increases, 
both the expected divergence and the fluctuations dimin- 
ish as 1/JV. Since a small divergence means that the two 
distributions are necessarily very similar, only the q that 
are very near f have a non vanishing probability — for D 
sufficiently small, this argument holds for any definition 
of similarity. 

As a consequence, it is reasonable to expand Vt^q) in 
its Taylor series in the neighborhood of f . Hence, Eq. 
© reads 



(31) 



Since P(q|f) decreases dramatically as q departs from 
f , the higher order terms (large k) in Eq. (pT|) should 
become negligible, at least, for large N. 

In the first place, the mean values of Eq. @ arc 
evaluated for the special case of the power law priors. 
This involves, basically, the computation of integrals in 
V of IT^ =1 (g, — fi) hi , for a set of non negative indexes 
(hi, feo, ...fcs) that sum up to K. This can be done using 
Eq. (Q). Of course, the term k = — that is, the raw 
guess-does not depend on JV. It may be shown that only 
k = 1 and k = 2 are proportional to 1/JV. Specifically, 



(ft - /;) 



~ Sfj) 
N + S(3 
PQ- ~ Sfj) 
^ N 



In the same way , if % ^ j 

((Qi-fi)(^-fj))= (33) 
Nfifj - P \fl + (1 + SP)(SfJ 3 - ft - /,)] 



fifj 

N 



{N + S(3){N + S/3+l) 
when N > S, 



whereas when i — j 

(fe-/«) 2 )= (34) 
NM1 - h) +/3[l+0 + Ml + Sp)(Sfi - 2)] 



(N + Sf3)(N + SP + 1) 
M1-/0 



when N > S. 



Summarizing, to first order in 1/JV, 
(W) w W(i) + 



(35) 



(36) 



dW 



s 

s 



d 2 W 



1 



2^ dq 2 



P(l ~ Sfj) 
N 

- Si) 



+ 



EE 

i—l j<i 



d 2 W 



dqidqj 



N 

fifj 

N ' 



This general formula allows the calculation of the first 
correction of the expectation value of an arbitrary func- 
tion W (q), whenever the prior is given by Eq. (||). 

Now, consider the more general case of an arbitrary 
prior. If .P(q) is not given by Eq. (|5[), then one can still 
proceed as above, but replacing W(q) by the product 
W(q)P(q), and setting p = 1. 



V. EXAMPLES 

Here, the expansion ( |36|) is applied to a few particular 
cases. Wolpert and Wolf M have al read y ca lcula ted the 
first two examples exactly (Subsect. V A and VB ), in the 
particular case of /3 = 1. Their results, once expanded up 
to first order in 1/JV are now compared to Eq. (|36|), for 
verification. The advantage of Eq. (|36|) is that, in con- 
trast to Wolpert and Wolf's approach, it applies to any 
function W. The counterpart, of course, is that it gives 
no more than the first correction to (W). Subsection V C 
deals with the calculation of moments. 



A. The mean value of the entropy 



In the first place, the function W(q) is taken to be the 



entropy H of the distribution q, defined in Eq. (Ill 
q = f . It is easy to verify that dH/dqi = — [1 



for 



when N ^> S. (32) whereas d 2 TL/dqidqj — —Sij/qi, where 5ij is Kroeneker 
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delta function: Sij = 1, if i = j and 5^ = 0, if i ^ j. 
Replacing in Eq. (36) and keeping only up to the first 
order in 1/N one arrives at 



(H) 



5-1 
2N 



(37) 



+ C(l/iV 2 ). (38) 



For the case of j3 = 1, this same expression is obtained 
by expanding the exact result, obtained in B 



< w >[3] = -£ 



Nfj + 1 
N + S 



$ (1) (jV/,+2)- 



i=i 

^(N + S + 1) 



(39) 



where ^-^(x) = dlnr(x)/da; is the Digamma function 



B. The mean value of the mutual information 



since the mean value in Eq. (M) involves the distribu- 
tion P(q|f). The approach inQ, instead, uses p(f|q), 
while the true q is fixed. In the present approach, the 
mean value (J) can be either higher or lower than /(f). 



C. The mean value of functions of X 

Consider a function g : {x\, xs} — > 7Z that maps 
the possible values of X into real numbers. For example, 
if X takes numerical values, then gk can be such that 
gk{xi) = x\. For each such g, another function G : T> — > 
1Z is defined, namely G(q) = J2i 9( x i)li- I n the example 
above, Gk is the Ai-moment of the distribution q. The 
expectation value (G) is easily calculated using Eq. (|3^), 
and reads 



(G)=G(f)(l-^f\+lj29^)- ( 42 ) 



In particular, for the gk considered above, this is the first 
order correction to all moments of q. 



Now W is taken to be the mutual information between 
two sets, as defined by Eq. (p8j). Replacing in Eq. (|3^), 



VI. NUMERICAL SIMULATIONS 



(i) = W (i - I + 

5i5 2 + 1 - Si - S 2 



(40) 



2N 



+ ^Vln 



fab 
fa.f.b 



Where Si and S2 are the number of elements in the sets 
Z 1 and Z 2 . When [3—1, Eq. (HQ) coincides with the 
expansion up to first order in 1/N of the exact result 
derived in 



W[3i = E 



Nf ab 



1 



ab 



E 



E 



N + 


S1S2 




f-SlS; 


Nf a 


+ 5 2 


N + 


S1S2 


\N- 


f SlS; 


Nf.b 


+ s a 



$ (1) (iV + 5iS 2 + 1) 



$ (1) Wa6 + 2)^ 



* (1) (iV/ Q . + 5 2 + l) 



$W(iV/. 6 + Si + l) 



(41) 



The quantities /„. and f.b in Eqs. (^) and are 
defined as in (p^). 

In contrast to the result obtained in (IJ , the first order 
correction to the mutual information does bear a depen- 
dence on the values of the individual probabilities fab- 
There is no conflict, however, between the two results, 



In this section, Eq. ( p6[ ) is confronted to the result 
of numerical simulations. Once again, and just to fol- 
low previous studies, V^(q) is set equal to the mutual 
information. However, in contrast to what was done up 
to now III 0, ||, the simulations are performed strictly 
within the present framework. That is, the measured 
frequency f is kept fixed, and the probability for the true 
q is evaluated. 

The procedure to measure numerically P(q|f) is now 
explained. As before, X takes values in a set of S ele- 
ments. Hence, f and q are S-dimensional vectors. The 
value of f is fixed. The domain T> is discretized into 
a number J of cells. Each cell corresponds to a vector 
q that will be visited by the program. The larger the 
number of cells J, the better the sampling of the do- 
main V. For each one of these cells, the value of X is 
measured iV times. The outcomes are sorted with the 
distribution q of the actual cell. If the frequency count 
thus obtained equals f , the counter of the selected cell 
is increased (there is counter for each cell in T>) . The 
comparison between the frequency count and the (fixed) 
f is done with precision e. The procedure is repeated M 
times (M large) in order to have enough counts. This 
algorithm allows to construct a histogram for the proba- 
bility that a given qgD generates the selected f . 

For simplicity, in the results below the number of trials 
M is the same for all cells. This is equivalent to using 
a uniform prior in T> (/3 = 1). A simulation with a non 
uniform prior can be carried out by choosing a different 
M for each cell. 

The two parameters that determine the precision of the 
simulations are J and e. If Dj is the Kullback-Leibler 
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divergence between two neighboring q cells, whenever 
1/N <C Dj then the only vector q that produces fre- 
quency counts equal to f is q = f . That is, for N suffi- 
ciently large, the discretized system behaves as if N = oo. 
Notice that for large J, two neighboring cells correspond 
to q and q + <Jq, with each Sqi oc J s_1 . Thus, the 
Kullback-Leibzig distance between the two is « S/J 5 " 1 . 
This means that when N reaches J s_1 / S, the simulation 
starts to behave as if N were actually infinite. 

On the other hand, if e is not small enough, one mistak- 
enly counts coincidences with f , just because the criterion 
used in the comparison is too brute. In other words, a 
large e allows that cells q too far away from f do give 
rise to frequency counts equal to f . That is, the system 
behaves as if N where smaller than its actual value. 

The dots in figure show the result of the above pro- 





FIG. 2: Probability distribution P(qi\fi) for the case fi — 
1/3, /3 — 1 and S — 2. Different curves correspond to several 
values of the number of samples N. The full line depicts the 
analytical result Eq. f|19|) , while the dots are the numerical 
simulations (see Sect. \V% . 

cedure, for a single component q±. As observed, there 
is very good agreement with the full line, showing the 
analytical result, Eq. Jl2]). 

To evaluate the expectation value of a certain function, 
one simply needs to calculate the sum 



(W}\ 



numerical 



]T W(q)P(q|f), 



(43) 



oils 



using the P(q|f) obtained with the algorithm explained 
above. Figure depicts the result for the mutual infor- 
mation, with (9 = 1. The dots represent the simulations, 
Eq. (M3), whereas the full line shows the analytical re- 
sult (|40|). The computational time required to evaluate 
P(q|f) increases exponentially with the number of dimen- 
sions S. Hence, in the present comparison it is desirable 
to keep S as small as possible. However, in order to de- 
fine a mutual information two sets Z 1 and Z 2 are needed, 
with Si and 52 elements each. In figure H Si = 2 and 



FIG. 3: Difference between the expectation value of the 
mutual information (/) and the measured /(f), as a func- 
tion of the inverse number of samples 1/N. The /3 = 1 
prior was considered. The full line represents the analyti- 
cal result, Eq. (j40|), and the dots the simulations. In (a), 
fu = fl2 = hi = J22 = 1/4, and 7(f) = 0. For each cell in V, 
30,000 sets of TV samples have been sorted. In (b), hi = 0-4, 
fi2 = 0.1, hi = 0.1, and / 22 = 0.4, so /(f) = 0.192745. For 
each cell in T>, 10,000 sets of N samples have been sorted. 
In both cases, each axis in q space has been divided in 20 
intervals, in order to discretize T>, while the parameter e was 
set to 0.0125. 



S2 = 2, thus making a 3 dimensional domain T>. 

In (a) the selected f had no mutual information: /(f) = 
0. The graph shows that the expectation value of / is 
positive. With the chosen parameters (see the caption 
of the figure), the analytical result (40) coincides exactly 
with the one derived by Treves and Panzeri |IJ], that is, 
(I) = (Si - 1)(S 2 - 1)/2N. Since for /(f) = 0, Eq. 
@ reduces to (I) = SiS 2 + 1 - Si - S 2 /2N, for some 
particular choices of Si and Sj the two expressions may 
coincide. It should be kept in mind, however, that this is 
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just a coincidence, and the two mean values have different 
meanings. 

In contrast, in case (b) the value of /(f) is large (see the 
caption for details). In this case, the simulations confirm 
the phenomenon that was pointed out in the previous 
section, namely, that the expectation value (7) may be 
lower than the measured /(f). 

It may be seen that for large N, all the dots concentrate 
in (J) = 7(f). This is, as pointed out before, due to the 
discretization of T>. If the number of cells J is increased, 
one needs to go to a larger N to find such a saturation. 
On the contrary, for smaller N, the simulated (I) lies 
below its theoretical value. This is a manifestation of 
the finite nature of e, and the phenomenon becomes less 
evident as e is lowered. 



VII. DISCUSSION 

In this work, the probability density P(q|f) for the 
true distribution q given the experimental frequencies f 
is analyzed. Such a density, it is shown, may be writ- 
ten as a Gibbs distribution, where the inverse number 
of samples plays the role of an effective temperature, 
and the Kullback-Leibzig divergence between f and q is 
the equivalent of the energy of state q. Its study is not 
only for academic purposes, but eventually also practical. 
In the ideal situation, it would be valuable to calculate 
P(q|f) while an experiment is being carried out, in order 
to know when the number of samples is already enough. 
The experimenter may thus decide to give an end to the 
sampling process when the width of P(q|f) reaches some 
acceptable value. For example, someone interested in 
measuring the public opinion prior to an election may 
wonder how many subjects need to be polled in order 
to have a reliable estimation of the forthcoming result. 
Many times, however, experiments comes to an end be- 
cause of other factors (a deadline, or a floor in the the 
amount of money, patience or students). An estimation 
of the width of P(q|f) is valuable even in these cases, 
just to provide error bars. 

One possibility is to write down the full P(q|f). How- 
ever, being a function of many variables, this may not 
be very practical. A convenient parameter measuring 
the width of P(q|f) in several directions is the square 
root of the corresponding eigenvalues of S. These have 
been shown to diminish asymptotically as 1/N. From 
the information-theoretical point of view, a more appeal- 
ing parameter is the mean divergence P, and its mean 
quadratic fluctuations. As is shown in Eq. (|2j), for small 
N such a width depends on the value of f . If TV 3> S, 
however, both (P) and <jd become independent of f and 



decrease as 1/JV (Eq. (|25|)). Yet another route is to work 
with the function VF(q) one is interested in. By means of 
Eq. (|3rj|), it is possible to decide whether the term pro- 
portional to 1/N is only a small correction to W(f) or, 
on the contrary, the two terms are comparable. In the 
latter case, more measurements should be carried out. 

Although some of the expressions presented here are 
valid for an arbitrary prior, much of the work deals with 
the particular case ofEq. (§). The use of a prior that is 
essentially a linear combination of functions of the form 
(||) has been proposed j5j, specifically, to be used in the 
inference of entropies. For this case, the partition func- 
tion should be constructed by applying the same linear 
superp osition to Eq. (|To|), and the same holds for Eqs. 
(|lj|-|l|). The calculation of (P) and arj as derivatives of 
P is still valid, whereas Eq. (na) should also be averaged. 

The analysis of P(q|f) carried out in Sect. II, and the 
statistical mechanical description of Sect. [n| are valid 
even for small N. The fact that (P) — > 1/N for large 
N inspires the expansion of (W) of Sect. [IV|. It should 
be clear, nevertheless, that such an expansion is only 
convergent when N ^> S. Actually, Eq. ( |l2|) is the first 
order term in powers of S/N, and there is no reason to 
think that the higher order terms will be negligible, if 
such a condition does not hold. Moreover, it is necessary 
to have Nfo ^> 1 for all i. When N is large enough, one 
can always define the number of categories S as to have 
them all well populated. But for N w S this may well 
not be the case. The consequences may, in fact, be quite 
dramatic. For instance, in the example of the entropy 
(Subsect. V A ) one can explicitly see that fi appears in 
the denominator of Eq. (p7|). In other words, the result 
is meaningless if there are empty categories. 

However, when the condition N S does hold, Eq. 
( |l2| ) may serve to draw non trivial conclusions. For 
instance, it is usually supposed that limited sampling, 
on average, flaws the data introducing false correlations. 
This work shows this is not necessarily the case: limited 
sampling may sometimes, on average, lower the correla- 
tions. This is clear in the simulations of Sect. VI, where 



finite sampling results, in mean, in a downwards bias of 
the mutual information. 
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APPENDIX A: INTEGRATING A POWER 
DISTRIBUTION IN V 

Here, Eq. is derived. An alternative and more 
general line of reasoning may be found in 
The aim is to calculate 



r = 



(Al) 




/ nf =1 d 9i q r 

Jv 

/ dqi q? 1 / ■•■ d 9s q™ s s 

Jo Jo 



where As is a constant ensuring that when all m.j vanish, 
Iq is the volume of T>. The supra-index in 1^ indicates 
the dimension of the vectors m and q. 

If X can only take two values, then 5 = 2. In this 
case, jLlj 

= / d <7i qT 1 / d, ?2 <?2 l2 <5 [^2 (l - qi - 92)] , 

Jo Jo 

= i [ d qi <?r(i-<zi) m2 

A2 Jo 



1 m\\m,2\ 

A2 (mi +7712 + 1)!' 



(A2) 



Now, the hypothesis is made for arbitrary S 
1 nf =1 mi! 



I 



A< 



(A3) 



To p rove it, one proceeds by complete induction. Eq. 
(|A3| ) is assumed true for a given m = (mi, ...,m<j) and 
the aim is to prove it for (m,-, ms+i). Hence 



tS+1 
(m 1 ,...,m s +i) 



Jt> 



S jS-1 



A5+I ("Hi — > m S-l) 

i-Ef=i 





( 1 - E 




^^^^ (mi,...,ms_i),ms+ms+i+l 



ms!m s+ i! 



(ms + ms+i + 1)! 
1 



nf+W 



As+i 



(5 + i)-i + E-Ii 



S+l 



(A4) 

(A5) 
-(A6) 



where Q(x) is Heaviside step function: Q(x) = 1 i f g> 1, 
and ®(x ) = if x < 0. When passing from Eq. (A4) to 
Eq. (A5), use was made of the result (A2). Acco rdingly, 
QA6| ) derives from the inductive hypothesis (A3). Since 
Eq. ( [Af^ ) coincides with (A3) when S is replaced by S+l, 
the hypothesis (A3) is proved true. 
Finally, to determine As one evaluates 

(A7) 



Xs(s-r 



The volume of V is VS/{S - 1)!, as can be verified, once 
again, by complete induction. Then As = 1/y/S. 



